title=“Final Project - ALY 6010”


Final Project Report


Intermediate Analytics

ALY 6015



Maheswar Raju Narasaiah

Prof: Eric Gero

Date:15 January, 2023


1. INTRODUCTION

In this assigment, we are going interpret and evaluate the models using Ames Housing Data.

We are going to further construct and analyze two regression models, interpret their results, and utilize diagnostic methods to identify and resolve any problems with the models.

Objectives Of Project

1: Develop and analyze regression models using established functions and diagnostic methods.

2: Address problems related to overfitting, linearity, multicollinearity and outliers.

3: Utilize automated techniques to determine the most appropriate model from a pool of multiple predictors.”

About the Datasets

We are going to use Ames Housing Data. The data set contains information on 2,930 properties in Ames, Iowa, including columns related to:

  • house characteristics (bedrooms, garage, fireplace, pool, porch, etc.)
  • location (neighborhood)
  • lot information (zoning, shape, size, etc.)
  • ratings of condition and quality
  • sale price


2. ANALYSIS

2.1. Load the library and Ames housing dataset

## Load the Library Used
library(magrittr)
library(knitr)
library(tidyverse)
library(plyr)
library(dplyr)
library(readxl)
library(gridExtra)
library(RColorBrewer)
library(lattice)
library(ggplot2)
library(corrplot)
library(summarytools)
library(DT)
library(kableExtra)
library(DescTools)
library(qcc)
library(agricolae)
library(car)
library(tidyverse)
library(RColorBrewer)
library(corrplot)
library(psych)
library(dplyr)
library(ggplot2)
library(gtools)
library(ggfortify)
library(GGally)
library(readr)
library(readxl)
library(knitr)
library(modelr)
library(scales)
library(lmtest)
library(olsrr)
library(leaps)
library(tibble)
library(sjPlot)
library(performance)
library(see)

# Load the data
Ames <- read_csv("~/Desktop/Intro To Analytics - ALY 6000/ALY 6000 - Project/Data Sets/AmesHousing.csv")


# Disabling scientific notation, so my graphs and outputs will be more readable:
options(scipen = 100)

2.2. Perform Exploratory Data Analysis and use descriptive statistics to describe the data.

Exploratory Data Analysis (EDA) is performed in order to better understand the underlying structure of the data, and to identify patterns, relationships, and outliers in the data set. It is an initial step in the data analysis process that helps to inform the decisions that will be made later in the analysis, such as which statistical models to use and which features to include in the models. Additionally, EDA can help to identify any issues or problems with the data, such as missing values or outliers, so that they can be addressed before modeling begins.

# 2. Perform Exploratory Data Analysis and use descriptive statistics to describe the data.
###########################################################################
# Histogram of prices
ggplot(Ames, aes(x = SalePrice)) +
    geom_histogram(color = "black", fill = "#ed610b", bins = 50) +
    labs(title = "Graph 1: Distribution of house prices", x = "Price", y = "Frequency") +
    theme_minimal()

barplot(table(Ames$"Yr Sold"),
    main = "Graph 2: When were the most houses Sold?",
    xlab = "Year",
    ylab = "Number of houses",
    col = brewer.pal(9, "Blues")
)

barplot(table(Ames$"Overall Qual"),
    main = "Graph 3: In what Quality are the most houses on the market?",
    xlab = "Year",
    ylab = "Number of houses",
    col = brewer.pal(10, "RdYlBu")
)

# Histogram of Living area
ggplot(Ames, aes_string(x = "`Gr Liv Area`")) +
    geom_histogram(color = "black", fill = "#2c0ce6", bins = 30) +
    scale_x_continuous(labels = comma) +
    labs(title = "Graph 4: Distribution of House Lot Area", x = "Living area (sqft)", y = "Frequency") +
    theme_minimal()

# Let's see median prices per neighborhood
neighbourhoods <- tapply(Ames$SalePrice, Ames$Neighborhood, median)
neighbourhoods <- sort(neighbourhoods, decreasing = TRUE)

dotchart(neighbourhoods,
    pch = 21, bg = "purple1",
    cex = 0.85,
    xlab = "Average price of a house",
    main = "Graph 5: Which neighborhood is the most expensive to buy a house in?"
)


Observations

  1. From Graph 1 we notice that, the house prices are rightly-skewed distributed with a majority of them being priced below $200,000. The data shows that the prices range from $12,789 to $755,000, with an average of $180,796 and a median price of $160,000.

  2. From Graph 2, We can see that most houses were sold in 2007, and suddenly we see reduction in 2008 because of Subprime mortgage crisis.

  3. From Graph 3, It appears that the majority of houses available are of average condition, with more well-maintained houses than those that are below average.

  4. From Graph 4, It appears that the majority of houses have a square footage of less than 2000 sqft. The data shows that the average square footage is 1500 sqft and the median is 1442 sqft.

  5. In Graph 5, I chose to use the median instead of the average because it is less affected by outliers, such as a single house with an extremely high value. The graph illustrates that the location of the neighborhood plays a significant role in determining the house prices, with the most expensive areas having prices three times higher than the least expensive areas. We can see that Stone Br Locality is the most expensive to buy a house in Ames.

2.3. Prepare the dataset for modeling by imputing missing values with the variable’s mean value

In this section, we are going to clean the dataset for modeling by imputing missing values with the variable’s mean value in “Mas Vnr Area” Variable

# Firstly, I'd like to check the missing values
na_count <- sapply(Ames, function(Ames) sum(length(which(is.na(Ames)))))

na_count
##           Order             PID     MS SubClass       MS Zoning    Lot Frontage 
##               0               0               0               0             490 
##        Lot Area          Street           Alley       Lot Shape    Land Contour 
##               0               0            2732               0               0 
##       Utilities      Lot Config      Land Slope    Neighborhood     Condition 1 
##               0               0               0               0               0 
##     Condition 2       Bldg Type     House Style    Overall Qual    Overall Cond 
##               0               0               0               0               0 
##      Year Built  Year Remod/Add      Roof Style       Roof Matl    Exterior 1st 
##               0               0               0               0               0 
##    Exterior 2nd    Mas Vnr Type    Mas Vnr Area      Exter Qual      Exter Cond 
##               0              23              23               0               0 
##      Foundation       Bsmt Qual       Bsmt Cond   Bsmt Exposure  BsmtFin Type 1 
##               0              80              80              83              80 
##    BsmtFin SF 1  BsmtFin Type 2    BsmtFin SF 2     Bsmt Unf SF   Total Bsmt SF 
##               1              81               1               1               1 
##         Heating      Heating QC     Central Air      Electrical      1st Flr SF 
##               0               0               0               1               0 
##      2nd Flr SF Low Qual Fin SF     Gr Liv Area  Bsmt Full Bath  Bsmt Half Bath 
##               0               0               0               2               2 
##       Full Bath       Half Bath   Bedroom AbvGr   Kitchen AbvGr    Kitchen Qual 
##               0               0               0               0               0 
##   TotRms AbvGrd      Functional      Fireplaces    Fireplace Qu     Garage Type 
##               0               0               0            1422             157 
##   Garage Yr Blt   Garage Finish     Garage Cars     Garage Area     Garage Qual 
##             159             159               1               1             159 
##     Garage Cond     Paved Drive    Wood Deck SF   Open Porch SF  Enclosed Porch 
##             159               0               0               0               0 
##      3Ssn Porch    Screen Porch       Pool Area         Pool QC           Fence 
##               0               0               0            2917            2358 
##    Misc Feature        Misc Val         Mo Sold         Yr Sold       Sale Type 
##            2824               0               0               0               0 
##  Sale Condition       SalePrice 
##               0               0
# 3. Imputation of Mean Value in "Mas Vnr Area" Variable
#################################################################
Ames$"Mas Vnr Area"[is.na(Ames$"Mas Vnr Area")] <- mean(Ames$"Mas Vnr Area", na.rm = TRUE)

2.4. Use the “cor()” function to produce a correlation matrix of the numeric values.

In this section, we are going to use the “cor()” function to produce a correlation matrix of the numeric values. Produce a plot of the correlation matrix, and explain how to interpret it.

A correlation matrix is a table showing the correlation coefficients between multiple variables. It is an important tool in identifying which variables are related to each other and the strength of the relationship.

The correlation coefficient ranges from -1 to 1, with -1 indicating a perfect negative correlation, 0 indicating no correlation and 1 indicating a perfect positive correlation.

Correlation matrix is important in various ways:

  • Identifying multicollinearity: It is a problem when two or more independent variables are highly correlated, this can cause problems in statistical models, such as linear regression.

  • Identifying patterns in data: It can help identify which variables are related and which are not. This can be useful in feature selection and modeling.

  • Identifying outliers: It can help identify outliers or extreme values in the data by identifying large correlation coefficients.

  • Identifying potential confounding variables: It can help identify potential confounding variables in observational studies.

  • Identifying which variables to include in a model: By identifying the relationship between different variables, it can help determine which variables should be included in a model.

Overall, correlation matrix is an important exploratory data analysis tool that helps to understand the relationships between different variables in a data set.

# 4. Use the "cor()" function to produce a correlation matrix of the numeric values.
###################################################################################

# Creating data subset without character variables
data.only.numeric <- Ames[, !sapply(Ames, is.character)]

only.numeric.noNA <- na.omit(data.only.numeric)



correlation.matrix <- cor(only.numeric.noNA, method = "pearson")


# Rounding off the digits in Table
table2 <- round((correlation.matrix), digits = 2)


# Present the table using kableExta Package
knitr::kable(table2,
    caption = "Table 2: Descriptive Statistics of MPG Data Set Using
    Code psych::describe () ",
    format = "html",
    table.attr = "style=width: 40%",
    font_size = 8
) %>%
    kable_styling(bootstrap_options = c(
        "striped", "hover",
        "condensed", "responsive"
    )) %>%
    kable_classic(
        full_width = F,
        html_font = "Times New Roman"
    )
Table 2: Descriptive Statistics of MPG Data Set Using Code psych::describe ()
Order Lot Frontage Lot Area Overall Qual Overall Cond Year Built Year Remod/Add Mas Vnr Area BsmtFin SF 1 BsmtFin SF 2 Bsmt Unf SF Total Bsmt SF 1st Flr SF 2nd Flr SF Low Qual Fin SF Gr Liv Area Bsmt Full Bath Bsmt Half Bath Full Bath Half Bath Bedroom AbvGr Kitchen AbvGr TotRms AbvGrd Fireplaces Garage Yr Blt Garage Cars Garage Area Wood Deck SF Open Porch SF Enclosed Porch 3Ssn Porch Screen Porch Pool Area Misc Val Mo Sold Yr Sold SalePrice
Order 1.00 0.00 0.03 -0.06 0.00 -0.07 -0.08 -0.03 -0.03 -0.02 0.00 -0.04 -0.02 0.02 -0.01 0.00 -0.04 0.02 -0.05 -0.02 0.03 -0.01 0.02 -0.01 -0.06 -0.04 -0.04 -0.01 0.02 0.03 -0.02 0.01 0.05 -0.01 0.14 -0.98 -0.04
Lot Frontage 0.00 1.00 0.49 0.20 -0.07 0.11 0.09 0.22 0.22 0.04 0.11 0.35 0.45 0.03 -0.01 0.38 0.11 -0.02 0.17 0.04 0.25 0.00 0.36 0.25 0.08 0.32 0.37 0.12 0.17 0.02 0.03 0.07 0.18 0.05 0.01 -0.01 0.35
Lot Area 0.03 0.49 1.00 0.14 -0.06 0.05 0.05 0.14 0.22 0.10 0.03 0.29 0.36 0.05 0.01 0.33 0.15 -0.02 0.14 0.06 0.16 -0.02 0.27 0.24 0.04 0.22 0.26 0.16 0.12 0.02 0.01 0.08 0.13 0.08 0.01 -0.02 0.31
Overall Qual -0.06 0.20 0.14 1.00 -0.17 0.61 0.58 0.44 0.29 -0.06 0.30 0.57 0.52 0.21 -0.04 0.59 0.18 -0.05 0.56 0.24 0.06 -0.14 0.41 0.39 0.58 0.60 0.56 0.27 0.33 -0.16 0.00 0.03 0.03 0.02 0.03 -0.01 0.80
Overall Cond 0.00 -0.07 -0.06 -0.17 1.00 -0.44 -0.01 -0.17 -0.08 0.05 -0.15 -0.21 -0.20 0.00 0.02 -0.16 -0.05 0.09 -0.26 -0.12 -0.01 -0.08 -0.12 -0.04 -0.35 -0.28 -0.25 -0.01 -0.11 0.09 0.01 0.05 -0.03 0.02 -0.01 0.03 -0.17
Year Built -0.07 0.11 0.05 0.61 -0.44 1.00 0.64 0.33 0.27 -0.04 0.17 0.43 0.34 0.00 -0.13 0.25 0.21 -0.04 0.51 0.25 -0.05 -0.12 0.14 0.15 0.83 0.54 0.48 0.23 0.24 -0.38 0.02 -0.06 0.00 -0.01 0.02 0.00 0.56
Year Remod/Add -0.08 0.09 0.05 0.58 -0.01 0.64 1.00 0.21 0.14 -0.06 0.19 0.32 0.28 0.13 -0.06 0.32 0.13 -0.06 0.49 0.18 -0.04 -0.15 0.21 0.14 0.65 0.47 0.41 0.23 0.28 -0.23 0.02 -0.05 -0.01 0.00 0.03 0.04 0.54
Mas Vnr Area -0.03 0.22 0.14 0.44 -0.17 0.33 0.21 1.00 0.32 -0.04 0.10 0.42 0.43 0.12 -0.05 0.43 0.16 -0.01 0.28 0.18 0.10 -0.02 0.32 0.28 0.27 0.37 0.39 0.18 0.14 -0.13 0.01 0.06 0.01 0.07 0.01 -0.02 0.53
BsmtFin SF 1 -0.03 0.22 0.22 0.29 -0.08 0.27 0.14 0.32 1.00 -0.05 -0.48 0.55 0.49 -0.17 -0.06 0.24 0.64 0.07 0.07 0.00 -0.12 -0.06 0.09 0.30 0.20 0.23 0.30 0.22 0.15 -0.10 0.04 0.09 0.11 0.13 0.01 0.02 0.44
BsmtFin SF 2 -0.02 0.04 0.10 -0.06 0.05 -0.04 -0.06 -0.04 -0.05 1.00 -0.25 0.07 0.05 -0.10 -0.01 -0.04 0.17 0.11 -0.08 -0.03 -0.02 -0.02 -0.07 0.04 -0.06 -0.06 -0.03 0.08 -0.01 0.02 -0.03 0.05 0.06 0.00 -0.01 0.02 -0.02
Bsmt Unf SF 0.00 0.11 0.03 0.30 -0.15 0.17 0.19 0.10 -0.48 -0.25 1.00 0.40 0.30 0.00 0.03 0.24 -0.41 -0.11 0.30 -0.04 0.16 0.03 0.23 0.01 0.20 0.25 0.22 -0.02 0.13 -0.01 -0.01 -0.04 -0.04 -0.01 0.01 -0.04 0.20
Total Bsmt SF -0.04 0.35 0.29 0.57 -0.21 0.43 0.32 0.42 0.55 0.07 0.40 1.00 0.83 -0.21 -0.03 0.47 0.33 0.01 0.35 -0.06 0.03 -0.05 0.30 0.33 0.38 0.47 0.52 0.24 0.28 -0.11 0.02 0.07 0.09 0.13 0.02 -0.01 0.65
1st Flr SF -0.02 0.45 0.36 0.52 -0.20 0.34 0.28 0.43 0.49 0.05 0.30 0.83 1.00 -0.26 -0.01 0.57 0.27 0.01 0.37 -0.11 0.07 0.07 0.39 0.40 0.31 0.48 0.53 0.24 0.27 -0.09 0.02 0.10 0.14 0.14 0.04 -0.01 0.64
2nd Flr SF 0.02 0.03 0.05 0.21 0.00 0.00 0.13 0.12 -0.17 -0.10 0.00 -0.21 -0.26 1.00 0.01 0.64 -0.17 -0.06 0.38 0.60 0.51 0.05 0.58 0.16 0.05 0.18 0.12 0.09 0.17 0.07 -0.03 0.01 0.05 -0.02 0.01 -0.04 0.25
Low Qual Fin SF -0.01 -0.01 0.01 -0.04 0.02 -0.13 -0.06 -0.05 -0.06 -0.01 0.03 -0.03 -0.01 0.01 1.00 0.09 -0.04 -0.01 0.00 -0.03 0.06 -0.02 0.08 0.01 -0.06 -0.02 -0.01 -0.01 0.01 0.10 0.00 0.02 0.05 0.00 0.01 0.02 -0.03
Gr Liv Area 0.00 0.38 0.33 0.59 -0.16 0.25 0.32 0.43 0.24 -0.04 0.24 0.47 0.57 0.64 0.09 1.00 0.07 -0.05 0.62 0.42 0.50 0.09 0.81 0.46 0.28 0.53 0.52 0.26 0.36 -0.01 -0.01 0.08 0.15 0.09 0.04 -0.04 0.71
Bsmt Full Bath -0.04 0.11 0.15 0.18 -0.05 0.21 0.13 0.16 0.64 0.17 -0.41 0.33 0.27 -0.17 -0.04 0.07 1.00 -0.13 -0.04 -0.03 -0.17 -0.02 -0.02 0.17 0.15 0.16 0.20 0.16 0.08 -0.07 0.02 0.05 0.06 0.01 0.01 0.04 0.28
Bsmt Half Bath 0.02 -0.02 -0.02 -0.05 0.09 -0.04 -0.06 -0.01 0.07 0.11 -0.11 0.01 0.01 -0.06 -0.01 -0.05 -0.13 1.00 -0.07 -0.05 0.00 -0.03 -0.06 0.04 -0.06 -0.05 -0.04 0.08 -0.05 0.00 0.04 0.02 0.09 0.05 0.01 -0.01 -0.05
Full Bath -0.05 0.17 0.14 0.56 -0.26 0.51 0.49 0.28 0.07 -0.08 0.30 0.35 0.37 0.38 0.00 0.62 -0.04 -0.07 1.00 0.14 0.33 0.12 0.52 0.23 0.51 0.53 0.45 0.18 0.29 -0.15 0.00 -0.01 0.03 -0.01 0.05 -0.01 0.56
Half Bath -0.02 0.04 0.06 0.24 -0.12 0.25 0.18 0.18 0.00 -0.03 -0.04 -0.06 -0.11 0.60 -0.03 0.42 -0.03 -0.05 0.14 1.00 0.25 -0.05 0.36 0.19 0.21 0.22 0.15 0.11 0.18 -0.07 -0.02 0.03 0.01 0.03 -0.01 -0.01 0.27
Bedroom AbvGr 0.03 0.25 0.16 0.06 -0.01 -0.05 -0.04 0.10 -0.12 -0.02 0.16 0.03 0.07 0.51 0.06 0.50 -0.17 0.00 0.33 0.25 1.00 0.19 0.65 0.09 -0.05 0.13 0.10 0.04 0.06 0.05 -0.05 0.02 0.04 0.00 0.04 -0.04 0.14
Kitchen AbvGr -0.01 0.00 -0.02 -0.14 -0.08 -0.12 -0.15 -0.02 -0.06 -0.02 0.03 -0.05 0.07 0.05 -0.02 0.09 -0.02 -0.03 0.12 -0.05 0.19 1.00 0.25 -0.09 -0.10 0.08 0.04 -0.09 -0.07 0.00 -0.02 -0.05 -0.01 -0.01 0.03 0.03 -0.11
TotRms AbvGrd 0.02 0.36 0.27 0.41 -0.12 0.14 0.21 0.32 0.09 -0.07 0.23 0.30 0.39 0.58 0.08 0.81 -0.02 -0.06 0.52 0.36 0.65 0.25 1.00 0.33 0.17 0.43 0.39 0.18 0.24 0.00 -0.04 0.03 0.08 0.07 0.04 -0.05 0.52
Fireplaces -0.01 0.25 0.24 0.39 -0.04 0.15 0.14 0.28 0.30 0.04 0.01 0.33 0.40 0.16 0.01 0.46 0.17 0.04 0.23 0.19 0.09 -0.09 0.33 1.00 0.10 0.27 0.24 0.22 0.17 -0.01 0.01 0.17 0.12 0.02 0.02 -0.01 0.46
Garage Yr Blt -0.06 0.08 0.04 0.58 -0.35 0.83 0.65 0.27 0.20 -0.06 0.20 0.38 0.31 0.05 -0.06 0.28 0.15 -0.06 0.51 0.21 -0.05 -0.10 0.17 0.10 1.00 0.60 0.58 0.24 0.25 -0.31 0.02 -0.06 -0.01 0.00 0.03 0.00 0.54
Garage Cars -0.04 0.32 0.22 0.60 -0.28 0.54 0.47 0.37 0.23 -0.06 0.25 0.47 0.48 0.18 -0.02 0.53 0.16 -0.05 0.53 0.22 0.13 0.08 0.43 0.27 0.60 1.00 0.85 0.23 0.25 -0.14 0.01 0.01 0.03 -0.01 0.06 -0.02 0.66
Garage Area -0.04 0.37 0.26 0.56 -0.25 0.48 0.41 0.39 0.30 -0.03 0.22 0.52 0.53 0.12 -0.01 0.52 0.20 -0.04 0.45 0.15 0.10 0.04 0.39 0.24 0.58 0.85 1.00 0.23 0.29 -0.10 0.01 0.04 0.06 0.03 0.04 -0.01 0.65
Wood Deck SF -0.01 0.12 0.16 0.27 -0.01 0.23 0.23 0.18 0.22 0.08 -0.02 0.24 0.24 0.09 -0.01 0.26 0.16 0.08 0.18 0.11 0.04 -0.09 0.18 0.22 0.24 0.23 0.23 1.00 0.05 -0.11 -0.04 -0.06 0.09 0.09 0.02 -0.01 0.33
Open Porch SF 0.02 0.17 0.12 0.33 -0.11 0.24 0.28 0.14 0.15 -0.01 0.13 0.28 0.27 0.17 0.01 0.36 0.08 -0.05 0.29 0.18 0.06 -0.07 0.24 0.17 0.25 0.25 0.29 0.05 1.00 -0.08 -0.01 0.07 0.06 0.11 0.05 -0.04 0.34
Enclosed Porch 0.03 0.02 0.02 -0.16 0.09 -0.38 -0.23 -0.13 -0.10 0.02 -0.01 -0.11 -0.09 0.07 0.10 -0.01 -0.07 0.00 -0.15 -0.07 0.05 0.00 0.00 -0.01 -0.31 -0.14 -0.10 -0.11 -0.08 1.00 -0.03 -0.06 0.12 0.00 -0.03 0.00 -0.14
3Ssn Porch -0.02 0.03 0.01 0.00 0.01 0.02 0.02 0.01 0.04 -0.03 -0.01 0.02 0.02 -0.03 0.00 -0.01 0.02 0.04 0.00 -0.02 -0.05 -0.02 -0.04 0.01 0.02 0.01 0.01 -0.04 -0.01 -0.03 1.00 -0.03 -0.01 0.00 0.03 0.02 0.01
Screen Porch 0.01 0.07 0.08 0.03 0.05 -0.06 -0.05 0.06 0.09 0.05 -0.04 0.07 0.10 0.01 0.02 0.08 0.05 0.02 -0.01 0.03 0.02 -0.05 0.03 0.17 -0.06 0.01 0.04 -0.06 0.07 -0.06 -0.03 1.00 0.03 0.02 0.03 -0.02 0.11
Pool Area 0.05 0.18 0.13 0.03 -0.03 0.00 -0.01 0.01 0.11 0.06 -0.04 0.09 0.14 0.05 0.05 0.15 0.06 0.09 0.03 0.01 0.04 -0.01 0.08 0.12 -0.01 0.03 0.06 0.09 0.06 0.12 -0.01 0.03 1.00 0.02 -0.06 -0.05 0.07
Misc Val -0.01 0.05 0.08 0.02 0.02 -0.01 0.00 0.07 0.13 0.00 -0.01 0.13 0.14 -0.02 0.00 0.09 0.01 0.05 -0.01 0.03 0.00 -0.01 0.07 0.02 0.00 -0.01 0.03 0.09 0.11 0.00 0.00 0.02 0.02 1.00 0.02 0.01 -0.01
Mo Sold 0.14 0.01 0.01 0.03 -0.01 0.02 0.03 0.01 0.01 -0.01 0.01 0.02 0.04 0.01 0.01 0.04 0.01 0.01 0.05 -0.01 0.04 0.03 0.04 0.02 0.03 0.06 0.04 0.02 0.05 -0.03 0.03 0.03 -0.06 0.02 1.00 -0.17 0.04
Yr Sold -0.98 -0.01 -0.02 -0.01 0.03 0.00 0.04 -0.02 0.02 0.02 -0.04 -0.01 -0.01 -0.04 0.02 -0.04 0.04 -0.01 -0.01 -0.01 -0.04 0.03 -0.05 -0.01 0.00 -0.02 -0.01 -0.01 -0.04 0.00 0.02 -0.02 -0.05 0.01 -0.17 1.00 -0.03
SalePrice -0.04 0.35 0.31 0.80 -0.17 0.56 0.54 0.53 0.44 -0.02 0.20 0.65 0.64 0.25 -0.03 0.71 0.28 -0.05 0.56 0.27 0.14 -0.11 0.52 0.46 0.54 0.66 0.65 0.33 0.34 -0.14 0.01 0.11 0.07 -0.01 0.04 -0.03 1.00

2.5. Produce a plot of the correlation matrix, and explain how to interpret it.

Interpreting a correlation matrix is fairly simple, it shows the correlation coefficients between different variables in the form of a table.

  1. The diagonal elements of the matrix are always 1, as a variable is always perfectly correlated with itself.

  2. The correlation coefficient ranges from -1 to 1.

  • A coefficient of 1 indicates a perfect positive correlation, which means that as one variable increases, the other variable also increases.

  • A coefficient of -1 indicates a perfect negative correlation, which means that as one variable increases, the other variable decreases.

  • A coefficient of 0 indicates no correlation, which means that the variables are independent of each other.

  1. Values close to 1 or -1 indicate a strong correlation, while values close to 0 indicate a weak correlation.

  2. Identify the variables that have a correlation coefficient greater than a certain threshold, usually taken as 0.7 or 0.8, these are highly correlated variables and are potential candidates for multicollinearity.

  3. Identify the variables that have a correlation coefficient close to 0, these are variables that are not correlated with any other variable in the dataset, and are not useful in the model.

  4. Identify the variables that have a correlation coefficient close to 1 or -1, these are variables that are highly correlated with other variables and can be useful in the model.

In summary, interpreting a correlation matrix is a useful tool to understand the relationships between different variables in a data set, it helps identifying the correlated variables and to avoid multicollinearity problem and also it can help in selecting the features for a predictive model.

# 5. Produce a plot of the correlation matrix, and explain how to interpret it.
###################################################################################
corrplot::corrplot(cor(correlation.matrix), tl.cex = 0.5)

2.6. Make a scatter plot for the X continuous variable with the highest correlation with SalePrice.

In this section, we are make scatter plot for the X continuous variable with the highest correlation with SalePrice. Do the same for the X variable that has the lowest correlation with SalePrice. Finally, make a scatter plot between X and SalePrice with the correlation closest to 0.5. Interpret the scatter plots and describe how the patterns differ.

# 6. Make a scatter plot for the X continuous variable with the highest correlation with
# SalePrice. Do the same for the X variable that has the lowest correlation with SalePrice.
# Finally, make a scatter plot between X and SalePrice with the correlation closest to 0.5. Interpret the scatter plots and describe how the patterns differ.


# Variable with highest correlation with SalePrice
################################################

# Creating Objects for Analysis
YSalePrice <- c(Ames$SalePrice)
XOverallQuality <- c(Ames$`Overall Qual`)


# Using the linear Regression Formula
linearReg2.6 <- lm(YSalePrice ~ XOverallQuality)

# Creatinng an object to store the summary of the linear regression
SumData2.6 <- summary(linearReg2.6)

# Extracting Values and Creating Object to store the value of Intercept and Slope
Intercept2.6 <- SumData2.6$coefficients[[1]]

Slope2.6 <- SumData2.6$coefficients[[2]]

# Plotting the Scatter Plot
plot(
    YSalePrice ~ XOverallQuality,
    pch = 19,
    col = "blue",
    xlab = "Overall Quality",
    ylab = "Sales Price",
    main = "Plot 1: Linear Regression: Overall Quality and Sale Price "
)


# Adding Lines and Text in Scatter Plot
abline(linearReg2.6, col = "#99004C", lty = 2, lwd = 2) # Adding the Regression Line

abline(v = 0, lwd = 2)

abline(h = 0, lwd = 2)

# Variable with lowest correlation with SalePrice
################################################

# Creating Objects for Analysis
YSalePrice <- c(Ames$SalePrice)
XMiscVal <- c(Ames$`Misc Val`)


# Using the linear Regression Formula
linearReg2.7 <- lm(YSalePrice ~ XMiscVal)

# Creatinng an object to store the summary of the linear regression
SumData2.6 <- summary(linearReg2.7)

# Extracting Values and Creating Object to store the value of Intercept and Slope
Intercept2.6 <- SumData2.6$coefficients[[1]]

Slope2.6 <- SumData2.6$coefficients[[2]]

# Plotting the Scatter Plot
plot(
    YSalePrice ~ XMiscVal,
    pch = 19,
    col = "#ff6600",
    xlab = "Miscellaneous feature",
    ylab = "Sales Price",
    main = "Plot 2: Linear Regression: Miscellaneous feature and Sale Price "
)


# Adding Lines and Text in Scatter Plot
abline(linearReg2.7, col = "#99004C", lty = 2, lwd = 2) # Adding the Regression Line

abline(v = 0, lwd = 2)

abline(h = 0, lwd = 2)

# 7. Variable with a correlation closest to 0.5
###############################################


# Creating Objects for Analysis
YSalePrice <- c(Ames$SalePrice)
XTotRmsAbvGrd <- c(Ames$`TotRms AbvGrd`)


# Using the linear Regression Formula
linearReg2.8 <- lm(YSalePrice ~ XTotRmsAbvGrd)

# Creatinng an object to store the summary of the linear regression
SumData2.6 <- summary(linearReg2.8)

# Extracting Values and Creating Object to store the value of Intercept and Slope
Intercept2.6 <- SumData2.6$coefficients[[1]]

Slope2.6 <- SumData2.6$coefficients[[2]]

# Plotting the Scatter Plot
plot(
    YSalePrice ~ XTotRmsAbvGrd,
    pch = 19,
    col = "#f6055d",
    xlab = "Total rooms above grade",
    ylab = "Sales Price",
    main = "Plot 3: Linear Regression: Total rooms above grade and Sale Price "
)


# Adding Lines and Text in Scatter Plot
abline(linearReg2.8, col = "#99004C", lty = 2, lwd = 2) # Adding the Regression Line

abline(v = 0, lwd = 2)

abline(h = 0, lwd = 2)


Observations

  1. From Plot 1 and 3, it can be clearly seen that when the price increases both the total rooms above grade and the house quality also increase. The line shows a linear model of a relationship between the living area and the house price. It can also be seen that there are some unusual observations present in the dataset.

  2. From Plot 2, we can Miscellaneous feature and Sale Price have strong negative linear relationhip

2.7. Using at least 3 continuous variables, fit a regression model in R.

# 7. Using at least 3 continuous variables, fit a regression model in R.

# Creating regression model
attach(only.numeric.noNA)
Table_regression <- lm(SalePrice ~ `Garage Area` + `Gr Liv Area` + `Total Bsmt SF`)
tab_model(Table_regression)
  SalePrice
Predictors Estimates CI p
(Intercept) -41364.17 -48085.46 – -34642.88 <0.001
Garage Area 114.57 101.94 – 127.20 <0.001
Gr Liv Area 72.24 67.53 – 76.96 <0.001
Total Bsmt SF 56.50 51.20 – 61.81 <0.001
Observations 2290
R2 / R2 adjusted 0.678 / 0.677

2.8. Report the model in equation form and interpret each coefficient of the model in the context of this problem.

summary(Table_regression)
## 
## Call:
## lm(formula = SalePrice ~ `Garage Area` + `Gr Liv Area` + `Total Bsmt SF`)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -713929  -19690     748   19829  256637 
## 
## Coefficients:
##                   Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)     -41364.167   3427.478  -12.07 <0.0000000000000002 ***
## `Garage Area`      114.570      6.442   17.79 <0.0000000000000002 ***
## `Gr Liv Area`       72.245      2.405   30.04 <0.0000000000000002 ***
## `Total Bsmt SF`     56.502      2.705   20.89 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 47430 on 2286 degrees of freedom
## Multiple R-squared:  0.6776, Adjusted R-squared:  0.6772 
## F-statistic:  1601 on 3 and 2286 DF,  p-value: < 0.00000000000000022

The equation representing my multiple linear regression is as follows:

y = -41364.17 + 114.57 * Garage Area + 72.24 * Gr Liv Area + 56.50 * Total Bsmt SF

2.9. Use the “plot()” function to plot your regression model.

# b. Plotting regression model
par(mfrow = c(2, 2))
plot(Table_regression)

2.10. Checking model for multicollinearity and report your findings.

There are several ways to address multicollinearity if it exists in a multiple regression analysis:

  • Remove one or more of the correlated predictor variables. This can be done by examining the correlation matrix and removing the variable with the highest correlation with the other predictors.

  • Combine correlated predictor variables into a single composite variable. This can be done using factor analysis or principal component analysis.

  • Use ridge regression or lasso regression, which are types of regularization that can reduce the standard errors of the estimates and make the model more stable.

  • Use a different model altogether, such as decision trees or random forests, which are less sensitive to multicollinearity.

It’s important to note that in practice, a combination of these methods is often employed to tackle multicollinearity.

# 10. Checking the model for multicollinearity
######################################
vif(Table_regression)
##   `Garage Area`   `Gr Liv Area` `Total Bsmt SF` 
##        1.584664        1.476125        1.490483


Observations

  • Our model has failed to meet the Homoscedasticity assumption as indicated by the non-random scattering of points in the Scale-Location plot. Additionally, the points in the Normal Q-Q plot deviate from the line, although the deviation is not extreme.

  • Furthermore, there are a few outliers or atypical observations present in both the residuals vs fitted plot and the residuals vs leverage plot.

2.11. Looking for unusual observations or outliers

# 11. Looking for unusual observations or outliers
#############################################
outlierTest(model = Table_regression)
##        rstudent                                              unadjusted p-value
## 1168 -16.452672 0.0000000000000000000000000000000000000000000000000000000014378
## 1702 -12.552210 0.0000000000000000000000000000000000536599999999999995957515797
## 1703  -8.447971 0.0000000000000000519539999999999994968174051925162646998077632
## 39     5.456008 0.0000000539519999999999989794678982493042473933542169106658548
## 1373   5.369376 0.0000000870020000000000004485083094848962836920236441073939204
## 834    5.193804 0.0000002242700000000000090895948537048076865119128342485055327
## 1368   5.013711 0.0000005749200000000000156154538084873895087412165594287216663
## 337    4.723603 0.0000024575999999999999836566497157797073214169358834624290466
## 2016  -4.402661 0.0000111859999999999992884068891751958574332093121483922004700
## 338    4.361758 0.0000134750000000000005813708889301771876034763408824801445007
##                                                      Bonferroni p
## 1168 0.0000000000000000000000000000000000000000000000000000032926
## 1702 0.0000000000000000000000000000001228799999999999954917130272
## 1703 0.0000000000001189699999999999969265483883002692554772643241
## 39   0.0001235499999999999944620687752916410317993722856044769287
## 1373 0.0001992300000000000017100210136788973613874986767768859863
## 834  0.0005135800000000000260780286254203019780106842517852783203
## 1368 0.0013166000000000000185601534141710544645320624113082885742
## 337  0.0056278999999999999165334330086807312909513711929321289062
## 2016 0.0256149999999999988808951911778422072529792785644531250000
## 338  0.0308580000000000000126565424807267845608294010162353515625
hat.plot <- function(fit) {
    p <- length(coefficients(Table_regression))
    n <- length(fitted(Table_regression))
    plot(hatvalues(Table_regression), main = "Index Plot of hat Values")
    abline(h = c(2, 3) * p / n, col = "red", lty = 2)
    identify(1:n, hatvalues(Table_regression), names(hatvalues(Table_regression)))
}

ols_plot_cooksd_chart(Table_regression)

par(mfrow = c(1, 1))

hat.plot(Table_regression)

## integer(0)


Observations

  • The graph displays several outliers, which may need to be removed as they fall outside the red line.

2.12 Removing unusual observations to improve model

# 12. Eliminating unusual observations to improve model
#############################################
cooksd <- cooks.distance(Table_regression)
sample_size <- nrow(data.only.numeric)
influential <- as.numeric(names(cooksd)[(cooksd > (4 / sample_size))])
only.numeric.no.outliers <- only.numeric.noNA[-influential, ]


# a. Looking at model now
attach(only.numeric.no.outliers)

Table_regression2 <- lm(SalePrice ~ `Garage Area` + `Gr Liv Area` + `Total Bsmt SF`)


par(mfrow = c(2, 2))
plot(Table_regression2)

summary(Table_regression2)
## 
## Call:
## lm(formula = SalePrice ~ `Garage Area` + `Gr Liv Area` + `Total Bsmt SF`)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -117102  -17408     770   17798   93213 
## 
## Coefficients:
##                   Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)     -39297.888   2520.187  -15.59 <0.0000000000000002 ***
## `Garage Area`      108.632      4.521   24.03 <0.0000000000000002 ***
## `Gr Liv Area`       73.264      1.749   41.88 <0.0000000000000002 ***
## `Total Bsmt SF`     55.144      1.935   28.50 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29150 on 2097 degrees of freedom
## Multiple R-squared:  0.7867, Adjusted R-squared:  0.7864 
## F-statistic:  2578 on 3 and 2097 DF,  p-value: < 0.00000000000000022
# 12. Attempt to correct any issues that you have discovered in your model. Did your changes improve the model, why or why not?


par(mfrow = c(1, 1))

hist(data.only.numeric$SalePrice)

hist(only.numeric.no.outliers$SalePrice)


Observations

  • Eliminating influential observations was necessary to improve the model, after which the model’s performance was significantly improved, as shown on the graph.

  • The Q-Q plot is almost perfect and the points are dispersed on the Scale-Location graph. The main issues of the model were resolved by removing the outliers in the data.

  • The histogram of the SalePrice shows that the distribution of the data has changed from being skewed to the right to having a normal distribution.

2.13 Use the all subsets regression method to identify the “best” model

# 13. Use the all subsets regression method to identify the "best" model.
########################################################################


regfit_full <- regsubsets(SalePrice ~ ., data = only.numeric.noNA)
## Reordering variables and trying again:
# a. Looking at the model selected by subsets method
model2 <- lm(SalePrice ~ `Overall Qual` + `BsmtFin SF 1` + `Gr Liv Area`)
summary(model2)
## 
## Call:
## lm(formula = SalePrice ~ `Overall Qual` + `BsmtFin SF 1` + `Gr Liv Area`)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -126812  -16880     145   17366  124651 
## 
## Coefficients:
##                  Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)    -85588.542   2882.940  -29.69 <0.0000000000000002 ***
## `Overall Qual`  25074.378    558.461   44.90 <0.0000000000000002 ***
## `BsmtFin SF 1`     33.779      1.481   22.82 <0.0000000000000002 ***
## `Gr Liv Area`      64.928      1.720   37.75 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27160 on 2097 degrees of freedom
## Multiple R-squared:  0.8148, Adjusted R-squared:  0.8145 
## F-statistic:  3074 on 3 and 2097 DF,  p-value: < 0.00000000000000022
plot(model2)


Observations

  • The programming language selected the best model that included the variables Overall.Qual, BsmtFin.SF.1, and Gr.Liv.Area. The model has a higher adjusted R2, indicating it is more suitable than the one created in this study. However, before reaching a conclusion, the regression model was plotted to evaluate if it meets the necessary assumptions.
  • The model has fewer outliers/influential observations than the initial model graphed in this paper before eliminating outliers to improve the model.

2.14 Compare the preferred model from step 13 with your model from step 12

compare_performance(Table_regression2, model2, rank = TRUE)
## # Comparison of Model Performance Indices
## 
## Name              | Model |    R2 | R2 (adj.) |      RMSE |     Sigma | AIC weights | AICc weights | BIC weights | Performance-Score
## ------------------------------------------------------------------------------------------------------------------------------------
## model2            |    lm | 0.815 |     0.814 | 27136.955 | 27162.824 |        1.00 |         1.00 |        1.00 |           100.00%
## Table_regression2 |    lm | 0.787 |     0.786 | 29121.002 | 29148.763 |    4.12e-65 |     4.12e-65 |    4.12e-65 |             0.00%
plot(compare_performance(Table_regression2, model2, rank = TRUE))


Observations

  • The results indicate that the model that performs the best is the one selected by the subsets method. The results show that the subset method model performed better.

  • A plot was also created to visually compare the performance of the two models, further confirming that the subset method model, referred as model2, is the superior choice.


3. CONCLUSIONS

  • This project showed us how regression analysis can be applied to check the associatiion for variables in Ames Housing Database, and to gain insight from it. Similarly, it can used in many industry like engineering, finance, meterology, etc.
  • The initial model was constructed using only continuous quantitative variables, specifically garage area, above grade living area, and total square footage of basement area. The second model, on the other hand, was generated using the subsets method, with the variables overall quality, above grade living area, and the rating of basement finished area (Type 1 finished square feet) selected as the top 3 predictors. The main difference between the two models is that the second one includes discrete variables, while the first one is limited to continuous variables. However, both models include the variable above grade living area.
  • When determining the most suitable model, it became evident that the second model was superior as it had a higher R2 value and an overall better score.
  • In conclusion, this analysis has shown that houses with better overall quality, above grade living area, and rating of basement finished area (Type 1 finished square feet) will have higher sales prices. This information can be extremely useful for potential buyers or real estate agents.


4. REFERENCES